Sampling for database systems

ABSTRACT

A database server supports weighted and unweighted sampling of records or tuples in accordance with desired sampling semantics such as with replacement (WR), without replacement (WoR), or independent coin flips (CF) semantics, for example. The database server may perform such sampling sequentially not only to sample non-materialized records, such as those produced as a stream by a pipeline in a query tree for example, but also to sample records, whether materialized or not, in a single pass. The database server also supports sampling over a join of two relations of records or tuples without requiring the computation of the full join and without requiring the materialization of both relations and/or indexes on the join attribute values of both relations.

TECHNICAL FIELD

[0001] The present invention relates generally to the field of databasesystems. More particularly, the present invention relates to the fieldof sampling records in a database system.

BACKGROUND OF THE INVENTION

[0002] Computer database systems manage the storage and retrieval ofdata in a database. A database comprises a set of tables of data alongwith information about relations between the tables. Tables representrelations over the data. Each table comprises a set of records or tuplesof data stored in one or more data fields. The records of a table arealso referred to as rows, and the data fields of records in a table arealso referred to as columns. A database server processes datamanipulation statements or queries, for example, to retrieve, insert,delete, and update data in a database. Queries are defined by a querylanguage supported by the database system.

[0003] For large databases such as data warehouses, for example, typicaltools such as On Line Analytical Processing (OLAP) and data mining serveas middleware or application servers that communicate data retrievalrequests to a backend database system through a query. Although the costof executing ad-hoc queries against the backend can be expensive, manydata mining applications and statistical analysis techniques can use asample of the data requested through the query. Similarly, OLAP serversthat answer queries involving aggregation (e.g., “find total sales forall products in the NorthWest region between Jan. 1, 1998 and Jan. 15,1998”) benefit from the ability to present to the user an approximateanswer computed from a sample of the result of the query posed to thedatabase.

[0004] Sampling is preferably supported not only on existing stored orbase relations but also on relations produced as a result of anarbitrary query. Sampling may be supported in relational databases as aprimitive operation SAMPLE(R,f), for example, to produce a sample S of rtuples that is an f-fraction of a relation R. Fully evaluating a query Qto compute relation R only to discard most of relation R when applyingSAMPLE(R,f), however, is inefficient. Preferably, query Q may bepartially evaluated so as to produce only sample S of relation R.

[0005] For a given query tree T for computing a relation R that is theresult of a query Q where SAMPLE(R,f) is the root or last operation ofquery tree T, pushing the sample operation down tree T toward its leaveswould help minimize the cost of evaluating query Q as only a smallfraction of stored and/or intermediate relations would be considered inevaluating query Q. The ability to commute the sample operation in thismanner, however, depends on the relational operations used in query treeT. The standard relational operation of selection can be freelyinterchanged with sampling. With join operations, however, sampling maynot be so easily commuted.

[0006]FIG. 1 illustrates a query tree 100 for obtaining a sample of ajoin of operand relations R₁ and R₂. Query tree 100 is executed inaccordance with a flow diagram 200 of FIG. 2. For step 202 of FIG. 2, arelation J is computed by joining R₁ and R₂, or J=R₁

R₂. For step 204, r tuples are randomly sampled from relation J toproduce a sample relation S. Commuting the sample operation in querytree 100 to operand relations R₁ and R₂, as illustrated by a query tree300 in FIG. 3, would minimize the cost of obtaining a join samplebecause only samples of operand relations R₁ and R₂ would need to bejoined. A join of samples of operand relations R₁ and R₂, however, willnot likely give a random sample of the join of operand relations R₁ andR₂.

[0007] As one example:

R₁(A,B)={(a₁,b₀), (a₂,b₁), (a₂,b₂), (a₂,b₃), . . . , (a₂,b_(n))}

[0008] and

R₂(A,C)={(a₂,c₀), (a₁,c₁), (a₁,c₂), (a₁,c₃), . . . , (a₁,c_(n))}.

[0009] That is, relation R₁ is defined over attributes A and B. Amongthe n+1 tuples of relation R₁, one tuple has an A-value a₁ and n tupleshave an A-value a₂, but all n+1 tuples of relation R₁ have distinctB-values. Similarly, relation R₂ is defined over attributes A and C.Among the n+1 tuples of relation R₂, n tuples have an A-value a₁ and onetuple has an A-value a₂, but all n+1 tuples of relation R₂ have distinctC-values.

[0010] Computing the equi-join of relations R₁ and R₂ over attribute Aproduces the following relation:

J=R₁

R₂={(a₁,b₀,c₁), (a₁,b₀,c₂), (a₁,b₀,c₃), . . . , (a₁,b₀,c_(n)),(a₂,b₁,c₀), (a₂,b₂,c₀), (a₂,b₃,c₀), . . . , (a₂,b_(n),c₀)}.

[0011] That is, relation J has n tuples with A-value a₁ and n tupleswith A-value a₂.

[0012] About one half of the tuples in a random sample S of relation J,or S⊂J, would likely have an A-value of a₁ while the remaining tupleswould have an A-value of a₂. A random sample S₁ of relation R₁, or S₁⊂R₁, however, would not likely comprise tuple (a₁,b₀), and a randomsample S₂ of relation R₂, or S₂ ⊂R₂, would not likely comprise tuple(a₂,c₀). The join of samples S₁ and S₂ would then likely comprise notuples and therefore would not likely give random sample S of relationJ.

[0013] One prior sampling strategy for obtaining a sample S of a join oftwo relations R₁ and R₂ with respect to a join attribute A isillustrated as a flow diagram 400 in FIG. 4.

[0014] For notational purposes, relations R₁ and R₂ have sizes n₁ andn₂, respectively. The domain of join attribute A is denoted by D. Foreach value v of domain D, or v ∈ D, m₁ (v) and m₂(v) denote the numberof distinct tuples in relations R₁ and R₂, respectively, that containvalue v in attribute A. Then, Σ_(v∈D) m₁(v)=n₁ and Σ_(v∈D) m₂(v)=n₂. Arelation J results from the computation of the join of relations R₁ andR₂, or J=R₁

R₂, and n is the size of relation J, or n=|J|=|R₁

R₂|. Then, n=Σ_(v∈D) m₁(v)m₂(v). For each tuple t of relation R₁, theset of tuples in relation R₂ that join with tuple t is denoted asJ_(t)(R₂)={t′∈ R₂|t′.A=t.A}; t

R₂ denotes the set of tuples in R₁

R₂ obtained by joining tuple t with the tuples in J_(t)(R₂); and |t

R₂|=|J_(t)(R₂) |=m₂(t.A). Similarly for each tuple t of relation R₂,J_(t)(R₁)={t′∈ R₁|t′.A=t.A}; R₁

t denotes the set of tuples in R₁

R₂ obtained by joining tuples in J_(t)(R₁) with tuple t; and |R₁

t|=|J_(t)(R₁)|=m₁(t.A).

[0015] For step 402 of FIG. 4, a variable r is initialized to the sizeof a sample relation S to be obtained from the join of relations R₁ andR₂. For step 404, a variable M is initialized to the upper bound on thenumber of join attribute values v in relation R₂ for all values v ofdomain D on attribute A. That is, M is the maximum number of any onejoin attribute value in relation R₂. A tuple t₁ is randomly sampled fromrelation R₁ for step 406. A tuple t₂ is then randomly sampled for step408 from among all tuples of relation R₂ having a join attribute valuet₂.A that matches the join attribute value t₁.A of tuple t₁. For step410, a tuple T is computed as T=t₁

t₂ and output for sample relation S with a probability based on thenumber of tuples in relation R₂ having a join attribute value thatmatches that of tuple t₁ divided by M, or m₂(t₂.A)/M. If not output, thesample tuple t₁ is rejected for step 410. If r tuples have not yet beenoutput for sample relation S as determined for step 412, steps 406through steps 412 are then repeated until r tuples have been output toform sample relation S as determined for step 412. Flow diagram 400 thenends for step 414.

[0016] The sampling technique of FIG. 4 in practice, however, requiresindexes for random access to relations R₁ and R₂, noting relation R₁must be materialized for proper sampling because the rejection of tuplesfor step 410 requires that the number of samples from relation R₁ be arandom variable having a distribution dependent upon the distribution ofjoin attribute values in relation R₂. This strategy therefore haslimited applicability in commuting sampling with joins involvingintermediate relations that are produced as a result of an arbitraryquery in a query tree and that are not materialized and indexed.

[0017] The ability to sample tuples produced as a stream, that is toperform sequential sampling, is significant not only becauseintermediate relations produced by a pipeline, such as in a query treefor example, may be sampled without materialization but also because arelation, whether materialized or not, may be sampled in a single pass.How and whether sequential sampling may be performed, however, maydepend on the chosen semantics for the sampling.

[0018] The tuples of a relation may be sampled, for example, using withreplacement (WR), without replacement (WoR), or independent coin flips(CF) semantics.

[0019] For WR sampling of an f-fraction of the n tuples in a relation R,each sampled tuple is chosen uniformly and independently from among alltuples in relation R, noting any one tuple could be sampled multipletimes. The sample is a bag or multiset of f*n tuples from relation R.

[0020] For WoR sampling an f-fraction of the n tuples in a relation R,f*n distinct tuples are sampled from relation R, noting each successivesampled tuple is chosen uniformly from the set of tuples not yetsampled. The sample is a set of f*n distinct tuples from relation R.

[0021] For CF sampling an f-fraction of the n tuples in a relation R,each tuple in relation R is chosen for the sample with probability f,independent of other tuples. Sampling in this manner is analogous toflipping a coin with bias f for each tuple in turn. The sample is a setof X distinct tuples from relation R, where X is a random variable withthe binomial distribution B(n,f) and has expectation f*n. The binomialdistribution B(n,f) is, in effect, the distribution of a random valuegenerated by counting the total number of heads when flipping nindependent coins, each of which has a probability f of being heads.Sampling using independent coin flip semantics is also called binomialsampling.

[0022] The sampling of a relation may also be weighted or unweighted.For unweighted sampling, each element is sampled uniformly at random.For weighted sampling, each element is sampled with a probabilityproportional to its weight for some pre-specified set of weights.

[0023] One prior sequential sampling technique uses CF semantics bysampling each passing tuple of a relation R with probability f for headsand adds the tuple to a sample S if the probability is satisfied.Another prior sequential sampling technique uses WoR semantics byinitializing a list or reservoir of r tuples with the first r tuples ofrelation R and repeatedly removing random tuples from the list whileadding tuples from relation R to the end of the list to produce a sampleS. Each of these techniques do not require the size of relation R inadvance and may therefore be used for sampling relations that are notmaterialized. Each of these techniques also preserve sortedness byproducing a sample of tuples in the same relative order as in relationR. The reservoir sampling technique, however, does not produce asequential output of tuples as no tuples are output until the techniquehas terminated. In the case of scanning a materialized relation on adisk, however, the reservoir sampling technique may be efficient byreading only those tuples to be entered in the reservoir by generatingrandom intervals of tuples to be skipped.

SUMMARY OF THE INVENTION

[0024] A sample operator for obtaining a sample of a plurality ofrecords in a database system has the plurality of records and samplingsemantics as parameters. The sampling semantics may be with replacement,without replacement, or coin flip sampling semantics. The sampleoperator may also have a size of the sample as a parameter and/or aweight function as a parameter to specify a sampling weight for eachrecord.

[0025] Another sample operator for obtaining a sample of a plurality ofrecords in a database system has the plurality of records as a parameterand a weight function as a parameter to specify a sampling weight foreach record. The sample operator may also have a size of the sample as aparameter.

[0026] A method obtains a sample from a plurality of records in adatabase system. The method may be implemented by computer-executableinstructions of a computer readable medium.

[0027] For the method, the plurality of records and sampling semanticsare identified from parameters of a sample operator, and a sample isobtained from the identified plurality of records using the identifiedsampling semantics. The identified sampling semantics may be withreplacement, without replacement, or coin flip sampling semantics. Asize of the sample to be obtained may be identified from a parameter ofthe sample operator, and the sample may be obtained from the identifiedplurality of records based on the identified sample size. A weightfunction may be identified from a parameter of the sample operator tospecify a weight for each record, and the sample may be obtained fromthe identified plurality of records based on the specified weight ofeach record.

[0028] The sample may be obtained by obtaining one record from theplurality of records, selectively outputting the one record one or moretimes based on a probability, and repeating these steps for one or moreother records of the plurality of records to obtain the sample. Thesample may also be obtained by obtaining one record from the pluralityof records, selectively resetting one or more records of a reservoir tobe the one record based on a probability, and repeating these steps forother records of the plurality of records such that the records of thereservoir form the sample.

[0029] Another method obtains a sample from a plurality of records in adatabase system. The method may be implemented by computer-executableinstructions of a computer readable medium.

[0030] For the method, the plurality of records and a weight functionare identified from parameters of a sample operator, wherein the weightfunction specifies a weight for each record, and a sample is obtainedfrom the identified plurality of records based on the specified weightof each record. A size of the sample to be obtained may be identifiedfrom a parameter of the sample operator, and the sample may be obtainedfrom the identified plurality of records based on the identified samplesize.

[0031] The sample may be obtained by obtaining one record from theplurality of records and the weight specified for the one record,selectively outputting the one record one or more times based on theweight specified for the one record, and repeating these steps for oneor more other records of the plurality of records to obtain the sample.The sample may also be obtained by obtaining one record from theplurality of records and the weight specified for the one record,selectively resetting one or more records of a reservoir to be the onerecord based on the weight specified for the one record, and repeatingthese steps for other records of the plurality of records such that therecords of the reservoir form the sample.

[0032] Another method performs a sequential sampling of records in onepass in a database system. The method may be implemented bycomputer-executable instructions of a computer readable medium. Thedatabase system may perform the method with suitable means.

[0033] For the method, one record from a plurality of records isobtained and selectively output one or more times based on aprobability. The plurality of records may be a relation produced as astream of records as a result of a query or may be materialized as abase relation in a database of the database system.

[0034] The one record may be selectively output by determining a randomnumber based on the probability such that the random number is greaterthan or equal to zero and outputting the one record the determinedrandom number of times. The random number may be determined from abinomial distribution based on the probability. The random number may bedetermined based on a probability based on a number of record(s) of theplurality of records to be evaluated for output, based on a probabilitybased on a weight of the one record divided by a sum of weight(s) ofrecord(s) of the plurality of records to be evaluated for output, orbased on a probability based on a fraction of the plurality of records.The random number may be determined such that the random number is lessthan or equal to a number of record(s) remaining to be output for thesample or such that the random number is less than or equal to a weightof the one record.

[0035] The one record may be selectively output one or more times basedon a weight specified for the one record. The one record may beselectively output based on a probability a number of time(s) equal innumber to the weight of the one record, and that probability may bebased on a number of record(s) remaining to be output for the sampledivided by a number of possible record(s) that may be output.

[0036] These steps are repeated for one or more other records of theplurality of records to form a sample of the plurality of records,wherein at least one obtained record may be output more than one time.The plurality of records may form a relation, and the sample may bejoined with records of another relation.

[0037] Another method performs a sequential sampling of records in onepass in a database system. The method may be implemented bycomputer-executable instructions of a computer readable medium. Thedatabase system may perform the method with suitable means.

[0038] For the method, one record from a plurality of records isobtained, and one or more records of a reservoir are selectively resetto be the one record based on a probability.

[0039] Each record of the reservoir may be selectively reset to be theone record based on a probability. One or more records of the reservoirmay be selectively reset to be the one record with a probability basedon a number of record(s) that have been obtained. One or more records ofthe reservoir may be selectively reset to be the one record based on aweight of the one record. One or more records of the reservoir may beselectively reset to be the one record with a probability based on aweight of the one record divided by a sum of weight(s) of record(s) thathave been obtained.

[0040] A random record of the reservoir may be selectively reset to bethe one record based on a probability a number of time(s) equal innumber to the weight of the one record. A random record of the reservoirmay be selectively reset to be the one record with a probability basedon a number of records in the reservoir divided by a sum of record(s)evaluated for reset in the reservoir.

[0041] These steps are repeated for other records of the plurality ofrecords such that the records of the reservoir form a sample of theplurality of records, wherein at least one obtained record may be usedto reset more than one record of the reservoir. The plurality of recordsmay form a relation, and the sample may be joined with records ofanother relation.

BRIEF DESCRIPTION OF THE DRAWINGS

[0042] The present invention is illustrated by way of example and notlimitation in the figures of the accompanying drawings, in which likereferences indicate similar elements and in which:

[0043]FIG. 1 illustrates a join query tree with a sample root operation;

[0044]FIG. 2 illustrates one prior art flow diagram for obtaining asample over a join operation;

[0045]FIG. 3 illustrates a join query tree with the sample operationcommuted to the operand relations of the join operation;

[0046]FIG. 4 illustrates another prior art flow diagram for obtaining asample over a join operation;

[0047]FIG. 5 illustrates an exemplary operating environment for samplingof records and sampling over join operations;

[0048]FIG. 6 illustrates an exemplary database system for sampling ofrecords and sampling over join operations;

[0049]FIG. 7 illustrates a flow diagram for one example of an unweightedsequential with replacement (WR) sampling technique;

[0050]FIG. 8 illustrates a flow diagram for another example of anunweighted sequential WR sampling technique;

[0051]FIG. 9 illustrates a flow diagram for one example of a weightedsequential WR sampling technique;

[0052]FIG. 10 illustrates a flow diagram for another example of aweighted sequential WR sampling technique;

[0053]FIG. 11 illustrates a flow diagram for one example of a weightedsequential without replacement (WoR) sampling technique;

[0054]FIG. 12 illustrates a flow diagram for another example of aweighted sequential WoR sampling technique;

[0055]FIG. 13 illustrates a flow diagram for one example of a weightedsequential independent coin flips (CF) sampling technique;

[0056]FIG. 14 illustrates a flow diagram for one example of a streamsampling technique over join operations;

[0057]FIG. 15 illustrates a flow diagram for one example of a groupsampling technique over join operations;

[0058]FIG. 16 illustrates a flow diagram for one example of a countsampling technique over join operations;

[0059]FIG. 17 illustrates a flow diagram for one example of a frequencypartition sampling technique over join operations; and

[0060]FIG. 18 illustrates a block diagram for one example of thefrequency partition sequential sampling technique of FIG. 17.

DETAILED DESCRIPTION OF THE INVENTION

[0061] A database server supports weighted and unweighted sampling ofrecords or tuples in accordance with desired sampling semantics such aswith replacement (WR), without replacement (WoR), or independent coinflips (CF) semantics, for example. The database server may perform suchsampling sequentially not only to sample non-materialized records, suchas those produced as a stream by a pipeline in a query tree for example,but also to sample records, whether materialized or not, in a singlepass. The database server also supports sampling over a join of tworelations of records or tuples without requiring the computation of thefull join and without requiring the materialization of both relationsand/or indexes on the join attribute values of both relations.

[0062] EXEMPLARY OPERATING ENVIRONMENT

[0063]FIG. 5 and the following discussion are intended to provide abrief, general description of a suitable computing environment in whichthe invention may be implemented. Although not required, the inventionwill be described in the general context of computer-executableinstructions, such as program modules, being executed by a personalcomputer. Generally, program modules include routines, programs,objects, components, data structures, etc., that perform particulartasks or implement particular abstract data types. Moreover, thoseskilled in the art will appreciate that the invention may be practicedwith other computer system configurations, including hand-held devices,multiprocessor systems, microprocessor-based or programmable consumerelectronics, network PCs, minicomputers, mainframe computers, and thelike. The invention may also be practiced in distributed computingenvironments where tasks are performed by remote processing devices thatare linked through a communications network. In a distributed computingenvironment, program modules may be located in both local and remotememory storage devices.

[0064] With reference to FIG. 5, an exemplary system for implementingthe invention includes a general purpose computing device in the form ofa conventional personal computer 520, including a processing unit 521, asystem memory 522, and a system bus 523 that couples various systemcomponents including system memory 522 to processing unit 521. Systembus 523 may be any of several types of bus structures including a memorybus or memory controller, a peripheral bus, and a local bus using any ofa variety of bus architectures. System memory 522 includes read onlymemory (ROM) 524 and random access memory (RAM) 525. A basicinput/output system (BIOS) 526, containing the basic routines that helpto transfer information between elements within personal computer 520,such as during start-up, is stored in ROM 524. Personal computer 520further includes a hard disk drive 527 for reading from and writing to ahard disk, a magnetic disk drive 528 for reading from or writing to aremovable magnetic disk 529, and an optical disk drive 530 for readingfrom or writing to a removable optical disk 531 such as a CD ROM orother optical media. Hard disk drive 527, magnetic disk drive 528, andoptical disk drive 530 are connected to system bus 523 by a hard diskdrive interface 532, a magnetic disk drive interface 533, and an opticaldrive interface 534, respectively. The drives and their associatedcomputer-readable media provide nonvolatile storage of computer-readableinstructions, data structures, program modules and other data forpersonal computer 520. Although the exemplary environment describedherein employs a hard disk, a removable magnetic disk 529 and aremovable optical disk 531, it should be appreciated by those skilled inthe art that other types of computer-readable media which can store datathat is accessible by a computer, such as magnetic cassettes, flashmemory cards, digital video disks, Bernoulli cartridges, random accessmemories (RAMs), read only memories (ROMs), and the like, may also beused in the exemplary operating environment.

[0065] A number of program modules may be stored on the hard disk,magnetic disk 529, optical disk 531, ROM 524 or RAM 525, including anoperating system 535, one or more application programs 536, otherprogram modules 537, and program data 538. A user may enter commands andinformation into personal computer 520 through input devices such as akeyboard 540 and pointing device 542. Other input devices may include amicrophone, joystick, game pad, satellite dish, scanner, or the like.These and other input devices are often connected to processing unit 521through a serial port interface 546 that is coupled to system bus 523,but may be connected by other interfaces, such as a parallel port, gameport or a universal serial bus (USB). A monitor 547 or other type ofdisplay device is also connected to system bus 523 via an interface,such as a video adapter 548. In addition to the monitor, personalcomputers typically include other peripheral output devices, such asspeakers and printers.

[0066] Personal computer 520 may operate in a networked environmentusing logical connections to one or more remote computers, such as aremote computer 549. Remote computer 549 may be another personalcomputer, a server, a router, a network PC, a peer device or othercommon network node, and typically includes many or all of the elementsdescribed above relative to personal computer 520, although only amemory storage device 550 has been illustrated in FIG. 5. The logicalconnections depicted in FIG. 5 include a local area network (LAN) 551and a wide area network (WAN) 552. Such networking environments arecommonplace in offices, enterprise-wide computer networks, intranets andthe Internet.

[0067] When used in a LAN networking environment, personal computer 520is connected to local network 551 through a network interface or adapter553. When used in a WAN networking environment, personal computer 520typically includes a modem 554 or other means for establishingcommunications over wide area network 552, such as the Internet. Modem554, which may be internal or external, is connected to system bus 523via serial port interface 546. In a networked environment, programmodules depicted relative to personal computer 520, or portions thereof,may be stored in remote memory storage device 550. It will beappreciated that the network connections shown are exemplary and othermeans of establishing a communications link between the computers may beused.

[0068] DATABASE SYSTEM

[0069]FIG. 6 illustrates one example of a computer database system 600for performing sampling of records and sampling over join operations.Database system 600 comprises a database 610, a database server 620, anda client tool 630. Database system 600 manages the storage and retrievalof data in database 610 in accordance with data manipulation statementsor queries presented to database server 620 by client tool 630, forexample. Client tool 630 may be, for example, a database application, auser interface tool, On Line Analytical Processing (OLAP) or data miningmiddleware, or an OLAP or data mining application server.

[0070] Database 610 comprises a set of tables of data along withinformation about relations between the tables. Tables representrelations over the data. Each table comprises a set of records or tuplesof data stored in one or more data fields. The records of a table arealso referred to as rows, and the data fields of records in a table arealso referred to as columns.

[0071] Database server 620 processes queries, for example, to retrieve,insert, delete, and/or update data in database 610. Database system 600may support any suitable query language, such as Structured QueryLanguage (SQL) for example, to define the queries that may be processedby database server 620. Suitable SQL queries include, for example,Select, Insert, Delete, and Update statements. Database server 620 forone embodiment comprises the Microsoft® SQL Server.

[0072] To enhance performance in processing queries, database server 620uses indexes to help access data in database 610 more efficiently.Database server 620 also generates efficient execution plans for querieswith respect to a set of indexes. In generating execution plans,database server 620 may use statistics such as histograms on thecolumn(s) of the tables referenced in a query. Database server 620 maystore such indexes and statistics in a catalog table 622, for example.

[0073] Because the cost of executing ad-hoc queries against database 610can be expensive, for example if database 610 is a large database suchas a data warehouse, database server 620 may sample not only existingbase relations stored in database 610 but also relations produced as aresult of an arbitrary query. Database server 620 may, for example,sample intermediate relations produced as a result of a query in a querytree.

[0074] Database server 620 supports weighted and unweighted sampling ofrecords or tuples in accordance with desired sampling semantics such aswith replacement (WR), without replacement (WoR), or independent coinflips (CF) semantics, for example. Database server 620 may perform suchsampling sequentially not only to sample non-materialized records, suchas those produced as a stream by a pipeline in a query tree for example,but also to sample records, whether materialized or not, in a singlepass. Database server 620 also supports sampling over a join of tworelations R₁ and R₂ without requiring the computation of the full joinJ=R₁

R₂ and without requiring the materialization of both relations R₁ and R₂and/or indexes on the join attribute values of both relations R₁ and R₂.

[0075] Database server 620 and client tool 630 are implemented for oneexample as program modules or computer-executable instructions and maybe stored on any suitable computer-readable medium for execution in asuitable operating environment, such as the computing environment ofFIG. 5 for example. The data of database 610 and catalog table 622 maybe implemented in the form of one or more suitable data structures andmay also be stored on any suitable computer-readable medium.

[0076] SAMPLE OPERATOR

[0077] Database server 620 supports a sample operator that may beinserted into a query execution plan or tree to obtain a weighted orunweighted sample S from a relation R using designated samplingsemantics.

[0078] The sample operator may have as parameters a relation R fromwhich a sample S is to be obtained, the size of sample S, the samplingsemantics to be used in obtaining sample S, and a weight function. Oneexemplary sample operator is SAMPLE(relation R, sample size, semantics,weight function). The sample size, sampling semantics, and/or weightfunction parameters may optionally be set or predetermined by thesampling technique performed by database server 620 in performing thesample operator.

[0079] The sample size parameter may be expressed as a number r oftuples to be obtained from relation R for sample S or as a fraction f ofthe n tuples of relation R to be obtained for sample S. Whether thenumber r or fraction f is to be designated may depend on the samplingtechnique performed by database server 620 in performing the sampleoperator. Where the number n of tuples of relation R is known or may bedetermined, however, database server 620 may determine the fraction ffor a designated number r and may also determine the number r for adesignated fraction f.

[0080] The sampling semantics parameter may identify any suitablesampling semantics such as, for example, with replacement (WR), withoutreplacement (WoR), or independent coin flips (CF) semantics.

[0081] The weight function parameter may designate unweighted sampling,for example by specifying the same weight, such as one for example, foreach tuple t of relation R, or may designate weighted sampling byspecifying how each tuple t of relation R is to be weighted.

[0082] Database server 620 may perform any suitable sampling techniquein performing the sample operator. For sampling of materializedrelations, database server 620 may perform any suitable non-sequentialsampling technique, including any suitable known unweightednonsequential sampling technique. Database server 620 may also performany suitable sequential sampling technique, including any suitable knownunweighted sequential WoR or CF sampling technique, not only to samplenon-materialized relations, such as those produced by a pipeline in aquery tree for example, but also to sample any relation, whethermaterialized or not, in a single pass.

[0083] UNWEIGHTED SEQUENTIAL WR SAMPLING

[0084] To obtain an unweighted with replacement (WR) sample S of anf-fraction or r tuples from a relation R of n tuples, for example inperforming a sample operator such as SAMPLE(R, r, WR, w(t)) where w(t)returns a constant for each tuple t of relation R for unweightedsampling, database server 620 uniformly and independently chooses eachtuple of sample S at random from relation R.

[0085] Database server 620 may perform unweighted sequential WR samplingin one pass by obtaining a tuple t from relation R, selectivelyoutputting one or more copies of tuple t based on a probability, andrepeating this technique for other tuples of relation R to form sampleS. As one example, database server 620 may perform unweighted sequentialWR sampling in accordance with a flow diagram 700 of FIG. 7.

[0086] For step 702 of FIG. 7, a variable r is initialized to the sizeof sample relation S to be obtained from relation R. A variable n isinitialized to the number of tuples in relation R for step 704. For step706, the variable ρ is initialized to r. The variable ρ maintains acount of the number of tuples remaining to be output for sample S. Forstep 708, a variable i is initialized to zero. The variable i maintainsa count of the number of tuples that have been obtained from relation Rfor evaluation.

[0087] Each tuple of relation R is evaluated for steps 710, 712, 714,716, 718, and 720 to determine whether any and, if so, how many copiesof that tuple are to be output for sample S.

[0088] For step 710, a next tuple t in relation R is obtained forevaluation, and the variable i is incremented by one for step 712. Forstep 714, a random variable X distributed as a binomial distribution isdetermined. An exemplary binomial distribution has as parameters thenumber ρ of tuples yet to be obtained for sample S and a probabilitybased on the number of tuples from relation R that have yet to beevaluated, such as B(ρ,1/(n−i+1)) for example. Noting the randomvariable X is an integer greater than or equal to zero and less than orequal to ρ, X copies of tuple t are output for sample S for step 716.Although illustrated as being distributed in accordance with binomialdistribution B(ρ,1/(n−i+1)) for random sampling, the variable X may bedetermined in accordance with any suitable probability, such as forbiased sampling for example. The number ρ of tuples yet to be obtainedfor sample S is accordingly reduced by X for step 718 and compared tozero for step 720.

[0089] If ρ is greater than zero for step 720, steps 710-720 arerepeated until all tuples of relation R have been evaluated asdetermined for step 710 or until the number ρ of tuples yet to beobtained for sample S has been reduced to zero as determined for step720. Flow diagram 700 then ends for step 722.

[0090] Sampling tuples in this manner advantageously produces the rtuples of sample S in the same relative order as in relation R and usesminimal auxiliary memory as only one tuple of relation R is stored at atime.

[0091] Although flow diagram 700 of FIG. 7 requires the size n ofrelation R, relation R may be materialized in which case the size of therelation may be already known or determined. Relation R may have alsobeen produced as a sample of another relation in which case the size ofrelation R may have been specified for relation R or determined inproducing relation R. The size n of relation R may optionally bedesignated as a parameter for a sample operator.

[0092] The sampling technique of FIG. 7 may be used for block-levelsampling of a relation materialized on a disk, for example, and may beused to skip over a random set of tuples, that is those tuples for whichthe random variable X is zero.

[0093] Database server 620 may also perform unweighted sequential WRsampling in one pass by obtaining a tuple t from relation R, selectivelyresetting one or more tuples of a reservoir to be tuple t based on aprobability, and repeating this technique for other tuples of relation Rto form sample S. As one example, database server 620 may performunweighted sequential WR sampling in accordance with a flow diagram 800of FIG. 8.

[0094] For step 802 of FIG. 8, a variable r is initialized to the sizeof sample relation S to be obtained from relation R. For step 804, avariable i is initialized to zero. The variable i maintains a count ofthe number of tuples that have been obtained from relation R forevaluation. For step 806, a reservoir array A[1. . . r] of r dummytuples is initialized. The r tuples of the reservoir array are resetwith tuples from relation R to obtain sample S.

[0095] Each tuple of relation R is evaluated for steps 808, 810, 812,814, 816, and 818 to determine whether any one or more of the tuples ofthe reservoir array are to be reset to that tuple from relation R.

[0096] For step 808, a next tuple t in relation R is obtained forevaluation, and the variable i is incremented by one for step 810. Forstep 812, a variable j is initialized to one. The variable j is used toindex each tuple in the reservoir array. For step 814, the jth tuple ofthe reservoir array is reset to be tuple t with a probability based onthe number of tuples that have been obtained from relation R forevaluation, such as 1/i for example. The variable j is incremented byone for step 816 and compared to r for step 818. If the variable j isless than or equal to r for step 818, steps 814-818 are repeated so theother tuples of the reservoir array may be possibly reset to be tuple t.When the variable j is greater than r, steps 808-818 are repeated untilall tuples of relation R have been evaluated as determined for step 808.The r tuples of the reservoir array are then output as sample S for step820.

[0097] Although the r tuples of sample S are not made available untileach tuple of relation R has been evaluated, sampling tuples in thismanner does not require the size n of relation R and may therefore beused to sample tuples from intermediate relations, for example, ofunknown size. Sampling tuples in this manner also uses minimal auxiliarymemory or storage as only r tuples of relation R are stored at a time.

[0098] The sampling technique of FIG. 8 may also be used for block-levelsampling of a relation materialized on a disk, for example, and may beused to skip over a random set of tuples.

[0099] WEIGHTED SEQUENTIAL WR SAMPLING

[0100] To obtain a weighted WR sample S of an f-fraction or r tuplesfrom a relation R of n tuples with each tuple t of relation R having aspecified weight w(t), for example in performing a sample operator suchas SAMPLE(R, r, WR, w(t)), database server 620 independently chooseseach tuple t of sample S at random from relation R with a probabilitybased on weight w(t) of tuple t.

[0101] Database server 620 may perform weighted sequential WR samplingin one pass by obtaining a tuple t from relation R and its specifiedweight w(t), selectively outputting one or more copies of tuple t basedon its weight w(t) and based on a probability, and repeating thistechnique for other tuples of relation R to form sample S. As oneexample, database server 620 may perform weighted sequential WR samplingin accordance with a flow diagram 900 of FIG. 9.

[0102] For step 902 of FIG. 9, a variable r is initialized to the sizeof sample relation S to be obtained from relation R. For step 904, avariable D is initialized to zero. The variable D maintains a sum of theweights of the tuples that have been obtained from relation R forevaluation. For step 906, the variable ρ is initialized to r. Thevariable ρ maintains a count of the number of tuples remaining to beoutput for sample S. For step 908, a variable W is initialized to thesum of the weights of all n tuples of relation R.

[0103] Each tuple of relation R is evaluated for steps 910, 912, 914,916, 918, and 920 to determine whether any and, if so, how many copiesof that tuple are to be output for sample S.

[0104] For step 910, a next tuple t in relation R is obtained with itsweight w(t) for evaluation, and the variable D is increased by theweight w(t) for step 912. For step 914, a random variable X distributedas a binomial distribution is determined. An exemplary binomialdistribution has as parameters the number p of tuples yet to be obtainedfor sample S and a probability based on weight w(t) of tuple t and thesum of the weight of the tuples that have yet to be evaluated fromrelation R, such as B(ρ,w(t)/(W−D+w(t))) for example. Noting the randomvariable X is an integer greater than or equal to zero and less than orequal to ρ, X copies of tuple t are output for sample S for step 916.Although illustrated as being distributed in accordance with thebinomial distribution B(ρ,w(t)/(W−D+w(t))) for random sampling, thevariable X may be determined in accordance with any suitableprobability, such as for biased sampling for example. The number ρ oftuples yet to be obtained for sample S is accordingly reduced by X forstep 918 and compared to zero for step 920.

[0105] If ρ is greater than zero for step 920, steps 910-920 arerepeated until all tuples of relation R have been evaluated asdetermined for step 910 or until the number ρ of tuples yet to beobtained for sample S has been reduced to zero as determined for step920. Flow diagram 900 then ends for step 922.

[0106] The sampling technique of FIG. 9 extends the unweighted samplingtechnique of FIG. 7 to weighted sampling. Indeed, the sampling techniqueof FIG. 9 may be used to implement the sampling technique of FIG. 7 byspecifying the same weight for each tuple t of relation R, such asw(t)=1 for example. The sampling technique of FIG. 9 generally sharesthe features of the sampling technique of FIG. 7, although the samplingtechnique of FIG. 9 requires the total weight W of the tuples ofrelation R. As one example, the total weight W may be determined indetermining the weight w(t) for each tuple t of relation R. The totalweight W of relation R may optionally be designated as a parameter for asample operator.

[0107] Database server 620 may also perform weighted sequential WRsampling in one pass by obtaining a tuple t from relation R and itsspecified weight w(t), selectively resetting one or more tuples of areservoir to be tuple t based on its weight w(t) and based on aprobability, and repeating this technique for other tuples of relation Rto form sample S. As one example, database server 620 may performweighted sequential WR sampling in accordance with a flow diagram 1000of FIG. 10.

[0108] For step 1002 of FIG. 10, a variable r is initialized to the sizeof sample relation S to be obtained from relation R. For step 1004, avariable D is initialized to zero. The variable D maintains a sum of theweights of the tuples that have been obtained from relation R forevaluation. For step 1006, a reservoir array A[1. . . r] of r dummytuples is initialized. The r tuples of the reservoir array are resetwith tuples from relation R to obtain sample S.

[0109] Each tuple of relation R is evaluated for steps 1008, 1010, 1012,1014, 1016, and 1018 to determine whether any one or more of the tuplesof the reservoir array are to be reset to that tuple from relation R.

[0110] For step 1008, a next tuple t in relation R is obtained with itsweight w(t) for evaluation, and the variable D is increased by w(t) forstep 1010. For step 1012, a variable j is initialized to one. Thevariable j is used to index each tuple in the reservoir array. For step1014, the jth tuple of the reservoir array is reset to be tuple t with aprobability based on weight w(t) of tuple t and the sum of the weight ofthe tuples that have been obtained from relation R for evaluation, suchas w(t)/D for example. The variable j is incremented by one for step1016 and compared to r for step 1018. If the variable j is less than orequal to r for step 1018, steps 1014-1018 are repeated so the othertuples of the reservoir array may be possibly reset to be tuple t. Whenthe variable j is greater than r, steps 1008-1018 are repeated until alltuples of relation R have been evaluated as determined for step 1008.The r tuples of the reservoir array are then output as sample S for step1020.

[0111] The sampling technique of FIG. 10 extends the unweighted samplingtechnique of FIG. 8 to weighted sampling. Indeed, the sampling techniqueof FIG. 10 may be used to implement the sampling technique of FIG. 8 byspecifying the same weight for each tuple t of relation R. such asw(t)=1 for example. The sampling technique of FIG. 10 generally sharesthe features of the sampling technique of FIG. 8.

[0112] WEIGHTED SEQUENTIAL WoR SAMPLING

[0113] A weighted without replacement (WoR) sample S of an f-fraction orr tuples from a relation R of n tuples with each tuple t of relation Rhaving a specified non-negative integer weight w(t) is assumed to be thesame as an unweighted WoR sample from a modification of relation R to arelation R^(w) containing w(t) copies of each tuple t of relation R.Database server 620 may perform weighted WoR sampling, for example, inperforming a sample operator such as SAMPLE(R, r, WoR, w(t)).

[0114] Database server 620 may perform weighted sequential WoR samplingin one pass by obtaining a tuple t from relation R and its specifiedweight w(t), selectively outputting one or more copies of tuple t basedon its weight w(t) and based on a probability, and repeating thistechnique for other tuples of relation R to form sample S. As oneexample, database server 620 may perform weighted sequential WoRsampling in accordance with a flow diagram 1100 of FIG. 11.

[0115] For step 1102 of FIG. 11, a variable r is initialized to the sizeof sample relation S to be obtained from relation R. For step 1104, avariable D is initialized to zero. The variable D maintains a sum of theweights of the tuples that have been obtained from relation R forevaluation. For step 1106, the variable ρ is initialized to r. Thevariable ρ maintains a count of the number of tuples remaining to beoutput for sample S. For step 1108, a variable W is initialized to thesum of the weights of all n tuples of relation R.

[0116] Each tuple of relation R is evaluated for steps 1110, 1112, 1114,1116, 1118, 1120, 1122, 1124, 1126, and 1128 to determine whether anyand, if so, how many copies of that tuple are to be output for sample S.

[0117] For step 1110, a next tuple t in relation R is obtained with itsweight w(t) for evaluation. A variable i is initialized to one for step1112. The variable i maintains an iteration count for evaluating tuple tw(t) times for output for sample S. The variable D is incremented forstep 1114. For step 1116, a binary random variable X is set to one witha probability based on the number ρ of tuples yet to be obtained forsample S and the number of possible tuples that may be output for sampleS, such as p/(W−D+1) for example. The variable X is compared to one forstep 1118. If the variable X is one, a copy of tuple t is output forsample S for step 1120. The number ρ of tuples yet to be obtained forsample S is accordingly reduced by X or one for step 1122 and comparedto zero for step 1124. If ρ is greater than zero for step 1124, thevariable i is then incremented for step 1126. If the variable X is zerofor step 1118, the variable i is incremented for step 1126 withoutoutput of tuple t for this ith iteration of steps 1114-1128.

[0118] The variable i is compared to weight w(t) for step 1128. If thevariable i is not greater than weight w(t), steps 1114-1128 are repeateduntil the variable i is greater than zero as determined for step 1128 oruntil the number ρ of tuples yet to be obtained for sample S has beenreduced to zero as determined for step 1124. If the variable i isgreater than weight w(t), steps 1110-1128 are repeated until all tuplesof relation R have been evaluated as determined for step 1110 or untilthe number ρ of tuples yet to be obtained for sample S has been reducedto zero as determined for step 1124. Flow diagram 1100 then ends forstep 1130.

[0119] Rather than iteratively evaluating each tuple t w(t) times forsteps 1114-1128, a suitable random generation technique that outputs Xnumber of copies of tuple t, where 0≦X≦w(t), may be used for otherexamples.

[0120] The sampling technique of FIG. 11 generally shares the featuresof the sampling technique of FIG. 9.

[0121] Database server 620 may also perform weighted sequential WoRsampling in one pass by obtaining a tuple t from relation R and itsspecified weight w(t), selectively resetting one or more tuples of areservoir to be tuple t based on its weight w(t) and based on aprobability, and repeating this technique for other tuples of relation Rto form sample S. As one example, database server 620 may performweighted sequential WoR sampling in accordance with a flow diagram 1200of FIG. 12.

[0122] For step 1202 of FIG. 12, a variable r is initialized to the sizeof sample relation S to be obtained from relation R. For step 1204, avariable D is initialized to zero. The variable D maintains a sum of theweights of the tuples that have been obtained from relation R forevaluation. For step 1206, a reservoir array A[1. . . r] of r dummytuples is initialized. The r tuples of the reservoir array are resetwith tuples from relation R to obtain sample S.

[0123] Each tuple of relation R is evaluated for steps 1208, 1210, 1212,1214, 1216, 1218, 1220, and 1222 to determine whether any one or more ofthe tuples of the reservoir array are to be reset to that tuple fromrelation R.

[0124] For step 1208, a next tuple t in relation R is obtained with itsweight w(t) for evaluation. A variable j is initialized to one for step1210. The variable j maintains an iteration count for evaluating tuple tw(t) times for output for sample S. The variable D is incremented forstep 1212. For step 1214, the variable D is compared to r. If thevariable D is less than or equal to r, the tuple A[D] of the reservoirarray is reset to be tuple t for step 1216. Otherwise, for step 1218, arandom tuple in the reservoir array is reset to be tuple t with aprobability based on the sample size r and the total number of times atuple from relation R has been evaluated for output for sample S, suchas r/D for example. Following step 1216 or 1218, the variable j isincremented for step 1220 and compared to w(t) for step 1222.

[0125] If the variable j is less than or equal to w(t) for step 1222,steps 1212-1222 are repeated so the other tuples of the reservoir arraymay be possibly reset to be tuple t. When the variable j is greater thanw(t), steps 1208-1222 are repeated until all tuples of relation R havebeen evaluated as determined for step 1208. The r tuples of thereservoir array are then output as sample S for step 1224.

[0126] Rather than iteratively evaluating each tuple t w(t) times forsteps 1212-1222, a suitable random generation technique that outputs Xnumber of copies of tuple t, where 0≦X≦w(t), may be used for otherexamples.

[0127] The sampling technique of FIG. 12 generally shares the featuresof the sampling technique of FIG. 10.

[0128] WEIGHTED SEQUENTIAL CF SAMPLING

[0129] A weighted independent coin flips (CF) sample S of an f-fractionor r tuples from a relation R of n tuples with each tuple t of relationR having a specified non-negative integer weight w(t) is assumed to bethe same as an unweighted CF sample from a modification of relation R toa relation RW containing w(t) copies of each tuple t of relation R.Database server 620 may perform weighted CF sampling, for example, inperforming a sample operator such as SAMPLE(R, r, CF, w(t)).

[0130] Database server 620 may perform weighted sequential CF samplingin one pass by obtaining a tuple t from relation R and its specifiedweight w(t), selectively outputting one or more copies of tuple t basedon its weight w(t) and based on a probability, and repeating thistechnique for other tuples of relation R to form sample S. As oneexample, database server 620 may perform weighted sequential CF samplingin accordance with a flow diagram 1300 of FIG. 13.

[0131] For step 1302 of FIG. 13, a variable f is set to the fraction oftuples to be sampled from relation R. For step 1304, a next tuple t inrelation R is obtained with its weight w(t) for evaluation. For step1306, a random variable X distributed as a binomial distribution isdetermined. An exemplary binomial distribution has as parameters theweight w(t) and a probability based on fraction f, such as B(w(t),f) forexample. Noting the random variable X is an integer greater than orequal to zero and less than or equal to w(t), X copies of tuple t areoutput for sample S for step 1308. Although illustrated as beingdistributed in accordance with the binomial distribution B(w(t),f) forrandom sampling, the random variable X may be determined in accordancewith any suitable probability, such as for biased sampling for example.Steps 1304-1308 are repeated until all tuples of relation R have beenevaluated as determined for step 1304. Flow diagram 1300 then ends forstep 1310.

[0132] Sampling tuples in this manner advantageously sequentiallyproduces the r tuples of sample S in the same relative order as inrelation R and uses minimal auxiliary memory as only one tuple ofrelation R is stored at a time. Sampling tuples in this manner also doesnot require the size n or total weight W of the relation R and maytherefore be used to sample tuples from intermediate relations, forexample, of unknown size.

[0133] SAMPLING CONVERSIONS

[0134] Database server 620 may also perform sampling for one type ofsemantics using a sampling technique of different semantics.

[0135] As one example, database server 620 can convert a WR samplingtechnique to a WoR sampling technique by checking each newly sampledtuple and rejecting that tuple if it has already been generated.

[0136] As another example, database server 620 can convert a CF samplingtechnique to a WoR sampling technique by sampling a slightly largerfraction f′ to ensure that at least f-fraction tuples are obtained andrejecting an appropriate number of tuples such that f-fraction of tuplesare obtained. The f-fraction of tuples may also be obtained from thef′-fraction of tuples by obtaining a WoR sample of f-fraction tuplesfrom the CF sample of f′-fraction tuples.

[0137] As yet another example, database server 620 can convert a WoRsampling technique to a WR sampling technique by sampling withreplacement from the WoR sample.

[0138] SAMPLING OVER JOINS

[0139] Database server 620 supports sampling over a join of tworelations R₁ and R₂ without requiring the computation of the full joinJ=R₁

R₂ and without requiring the materialization of both relations R₁ and R₂and/or indexes on the join attribute values of both relations R₁ and R₂.

[0140] Referring to the previous example where:

R₁(A,B)={(a₁,b₀), (a₂,b₁), (a₂,b₂), (a₂,b₃), . . . , (a₂,b_(n))}

[0141] and

R₂(A,C)={(a₂,c₀), (a₁,c₁), (a₁,c₂), (a₁,c₃), . . . , (a₁,c₁)},

[0142] the projection, after removal of duplicates, of the relation J=R₁

R₂ onto attributes A and B, for example, does not yield a uniform randomsample of relation R₁ but rather gives a biased or weighted sample ofrelation R₁ where each tuple of relation R₁ is sampled with aprobability dependent on the number of tuples in relation R₂ joiningwith relation R₁. Specifically, the tuple (a₁,b₀) of relation R₁ issampled with probability ½ while the remaining tuples are each sampledwith probability ½n. Similarly, the tuple (a₂,c₀) of relation R₂ issampled with probability ½ while the remaining tuples are each sampledwith probability ½n. The skewness of the resulting distribution stemsfrom the removal of duplicates. Also, the relatively high skew inrelations R₁ and R₂ prevents samples of relations R₁ and R₂ fromreflecting frequently appearing attribute values in the join outputadequately.

[0143] Although a join of random samples of operand relations R₁ and R₂will not likely give a random sample of the join of operand relations R₁and R₂, a sample of the join of relations R₁ and R₂ may be obtained fromnon-uniform samples of relations R₁ and R₂.

[0144] Considering a tuple t=(a₁,b₀)∈R₁ and its influence on relationJ=R₁

R₂, the number of tuples in relation R₁ containing the value a₁ inattribute A is one, that is m₁(a₁)=1, yet the number J_(t)(R₂) of tuplesin relation R₂ joining with tuple t is m₂(a₁). Although a random sampleof relation R₁ will unlikely have the one tuple with the A-value a₁, thefraction of tuples in relation J having A-value a₁ is m₂(a₁)/|J| whichcould be relatively large. The structure of relation J is more likelyreflected in a sample S₁ of relation R₁ where a tuple t from relation R₁is sampled with a probability based on the number m₂(t.A) of matchingjoin attribute values t.A in relation R₂.

[0145] Database server 620 may therefore obtain a sample S of the joinof relations R₁ and R₂ by sampling tuples from relation R₁ based on thenumber of matching join attribute values in relation R₂ and joiningtuples of the resulting sample S₁ with those of relation R₂. Databaseserver 620 may use frequency statistics on the join attribute values ofrelation R₂ in sampling tuples of relation R₁. Because relation R₂ mayor may not be materialized and/or indexed, how database server 620obtains a sample of the join of relations R₁ and R₂ depends on theavailability of information for relations R₁ and/or R₂ such as in thefollowing cases:

[0146] Case A: no information is available for either relation R₁ or R₂;

[0147] Case B: no information is available for relation R₁ but indexesand/or statistics are available for relation R₂; or

[0148] Case C: indexes and/or statistics are available for relations R₁and R₂.

[0149] Any sampling technique for a case with less available informationwill also apply for a case with more available information. When noinformation is available for a relation, database server 620 may assumethe relation is not materialized but rather is produced as a stream by apipeline process.

[0150] For Case A, database 620 has no frequency statistics or indexesfor relation R₂ to help guide any sampling of relation R₁. Databaseserver 620 may therefore use the sampling technique of FIG. 2 to computethe full join J=R₁

R₂ and obtain a sample relation S of r tuples from relation J byrejection sampling each tuple of relation J with probability 1/|J|.

[0151] Although the prior art sampling technique of FIG. 4 avoids havingto compute the full join J=R₁

R₂, in practice it requires the materialization of relation R₁ and anindex for random access for relation R₁ for step 406, full statistics onthe join attribute values of relation R₂ for step 410, and thematerialization of relation R₂ and an index for random access ofrelation R₂ for step 408. The sampling technique of FIG. 4 thereforeapplies only to Case C of the above three cases.

[0152] For Case B, however, database server 620 may avoid having tocompute the full join J=R₁

R₂ as well as avoid the requirement for the materialization of bothrelations R₁ and R₂ and indexes on the join attribute values of bothrelations R₁ and R₂ because database server 620 may obtain a samplerelation S of r tuples of the join of relations R₁ and R₂ usingfrequency statistics and/or an index on the join attribute values ofrelation R₂ in sampling tuples of relation R₁.

[0153] JOIN SAMPLE OPERATOR

[0154] Database server 620 supports a join sample operator that may beinserted into a query execution plan or tree to obtain a sample S from ajoin of two relations R₁ and R₂ using designated sampling semantics.

[0155] The join sample operator may have as parameters relations R₁ andR₂, the size of sample S, and the sampling semantics to be used inobtaining sample S. One exemplary join sample operator isJOINSAMPLE(relation R₁, relation R₂, sample size, semantics). The samplesize and sampling semantics parameters may optionally be set orpredetermined by the join sampling technique performed by databaseserver 620 in performing the join sample operator.

[0156] The sample size parameter may be expressed as a number r oftuples to be obtained from the resulting relation J=R₁

R₂ for sample S or as a fraction f of the n tuples of relation J to beobtained for sample S. Whether the number r or fraction f is to bedesignated may depend on the join sampling technique performed bydatabase server 620 in performing the join sample operator. Where thenumber n of tuples of relation J is known or may be determined, however,database server 620 may determine the fraction f for a designated numberr and may also determine the number r for a designated fraction f.

[0157] The sampling semantics parameter may identify any suitablesampling semantics such as, for example, with replacement (WR), withoutreplacement (WoR), or independent coin flips (CF) semantics.

[0158] Database server 620 may perform any suitable join samplingtechnique in performing the join sample operator. Database server 620may perform the join sampling technique of FIG. 2, for example, usingfor step 204 the sampling semantics as designated by a join sampleoperator or as predetermined by database server 620. Database server 620may avoid materializing relation J for step 204 by performing asequential sample of relation J using any suitable sequential samplingtechnique, such as the sampling technique of FIG. 7 or 8 for WRsemantics for example. Database server 620 may also perform the joinsampling technique of FIG. 4, for example, using for step 406 thesampling semantics as designated by a join sample operator or aspredetermined by database server 620. To avoid having to compute thefull join J=R₁

R₂ as for the join sampling technique of FIG. 2 and to avoid therequirement for the materialization of and indexes on both relations R₁and R₂ as for the join sampling technique of FIG. 4, database server 620may perform the join sampling techniques of FIG. 14, 15, 16, or 17.

[0159] STREAM SAMPLING

[0160] Database server 620 for one example may perform a stream samplingtechnique over joins in accordance with a flow diagram 1400 of FIG. 14.

[0161] For step 1402 of FIG. 14, a variable r is initialized to the sizeof sample relation S to be obtained from the join of relations R₁ andR₂. For step 1404, a weighted sample relation S₁ of r tuples s₁, . . . ,s_(r) is obtained from relation R₁ based on the number m₂(t.A) ofmatching join attribute values t.A in relation R₂ for each tuple t ofrelation R₁. Any suitable weighted sampling technique using the samplingsemantics as designated by a join sample operator or as predetermined bydatabase server 620 may be used to obtain sample relation S₁.

[0162] For WR semantics, the weighted sampling technique of FIG. 9 or10, for example, may be used with each tuple t of relation R₁ having aspecified weight proportional to the number of tuples in relation R₂having a join attribute value that matches that of tuple t. That is, theweight w(t) for each tuple t∈R₁ is set to m₂(t.A). For WoR semantics,the weighted sampling technique of FIG. 11 or 12, for example, may beused with this same weight specification.

[0163] For CF semantics, the weighted sampling technique of FIG. 13, forexample, may be used with this same weight specification, noting thatthe sampling fraction f for the sampling technique of FIG. 13 may bedetermined from the sample size r of step 1402 and that the sample sizer may be redefined for step 1404 as the sampling technique of FIG. 13may not return a sample relation S₁ of the original sample size r.Alternatively, the sampling fraction f may be initialized for step 1402,and the sample size r may be determined based on the number of tuplesthe sampling technique of FIG. 13 returns for sample relation S₁.

[0164] For step 1406, a variable i is initialized to one. The variable iis used to denote the ith tuple S₁ of sample relation S₁. For step 1408,a tuple t₂ is randomly sampled from among all tuples of relation R₂having a matching join attribute value with tuple s₁ of sample relationS₁, that is t₂.A=s₁.A. For step 1410, a tuple T is computed as the joinT=s₁

t₂ and output for sample relation S. The variable i is incremented forstep 1412 and compared to r for step 1414. If i is less than or equal tor, steps 1408-1414 are repeated for the remaining tuples s_(i) of samplerelation S₁. If i is greater than r, indicating all tuples of samplerelation S₁ have been joined with a tuple from relation R₂, flow diagram1400 ends for step 1416.

[0165] In practice, the stream sampling technique of FIG. 14 may be usedwhere no information is available for relation R₁ and may assume forstep 1404 the availability of frequency statistics on the join attributevalues for relation R₂ and for step 1408 an index for random access ofrelation R₂. Whether relation R₁ is materialized or produced as a streamfrom a pipeline process, the sampling technique of FIG. 14 maysequentially sample tuples from relation R₁ for step 1404 in one pass.As each tuple s₁ of sample relation S₁ is sequentially output, then,steps 1408 and 1410 may be performed to produce and output each tuple Tfor sample relation S sequentially.

[0166] GROUP SAMPLING

[0167] Database server 620 for another example may perform a groupsampling technique over joins in accordance with a flow diagram 1500 ofFIG. 15.

[0168] For step 1502 of FIG. 15, a variable r is initialized to the sizeof sample relation S to be obtained from the join of relations R₁ andR₂. For step 1504, a weighted sample relation S₁ of r tuples s₁, . . .,s_(r) is obtained from relation R₁ based on the number m₂(t.A) ofmatching join attribute values t.A in relation R₂ for each tuple t ofrelation R₁. Any suitable weighted sampling technique using the samplingsemantics as designated by a join sample operator or as predetermined bydatabase server 620 may be used to obtain sample relation S₁.

[0169] For WR semantics, the weighted sampling technique of FIG. 9 or10, for example, may be used with each tuple t of relation R₁ having aspecified weight proportional to the number of tuples in relation R₂having a join attribute value that matches that of tuple t. That is, theweight w(t) for each tuple t∈R₁ is set to m₂(t.A). For WoR semantics,the weighted sampling technique of FIG. 11 or 12, for example, may beused with this same weight specification.

[0170] For CF semantics, the weighted sampling technique of FIG. 13, forexample, may be used with this same weight specification, noting thatthe sampling fraction f for the sampling technique of FIG. 13 may bedetermined from the sample size r of step 1502 and that the sample sizer may be redefined for step 1504 as the sampling technique of FIG. 13may not return a sample relation S₁ of the original sample size r.Alternatively, the sampling fraction f may be initialized for step 1502,and the sample size r may be determined based on the number of tuplesthe sampling technique of FIG. 13 returns for sample relation S₁.

[0171] For step 1506, a relation J is computed as the join J=S₁

R₂. The relation J has tuples grouped by the tuples s₁ , . . . , s_(r)of sample relation S₁. For step 1508, one tuple is randomly sampled fromeach group of tuples of relation J corresponding to each tuple s₁ , . .. , s_(r) to produce sample relation S. Any suitable sampling technique,such as the sampling technique of FIG. 7 or 8 for example, may be usedto sample a tuple from the group corresponding to each tuple s_(i). Forthe sampling technique of FIG. 7, the size of each group is m₂(s₁.A).

[0172] The group sampling technique of FIG. 15 may be used where noinformation is available for relation R₁ and where frequency statisticson the join attribute values yet no index is available for relation R₂.Whether relation R₁ is materialized or produced as a stream from apipeline process, the sampling technique of FIG. 15 may sequentiallysample tuples from relation R₁ for step 1504 in one pass.

[0173] COUNT SAMPLING

[0174] Database server 620 for another example may perform a countsampling technique over joins in accordance with a flow diagram 1600 ofFIG. 16.

[0175] For step 1602 of FIG. 16, a variable r is initialized to the sizeof sample relation S to be obtained from the join of relations R₁ andR₂. For step 1604, a weighted sample relation S₁ of r tuples s₁, . . . ,s_(r) is obtained from relation R₁ based on the number m₂(t.A) ofmatching join attribute values t.A in relation R₂ for each tuple t ofrelation R₁. Any suitable weighted sampling technique using the samplingsemantics as designated by a join sample operator or as predetermined bydatabase server 620 may be used to obtain sample relation S₁.

[0176] For WR semantics, the weighted sampling technique of FIG. 9 or10, for example, may be used with each tuple t of relation R₁ having aspecified weight proportional to the number of tuples in relation R₂having a join attribute value that matches that of tuple t. That is, theweight w(t) for each tuple t∈R₁ is set to m₂(t.A). For WoR semantics,the weighted sampling technique of FIG. 11 or 12, for example, may beused with this same weight specification.

[0177] For CF semantics, the weighted sampling technique of FIG. 13, forexample, may be used with this same weight specification, noting thatthe sampling fraction f for the sampling technique of FIG. 13 may bedetermined from the sample size r of step 1602 and that the sample sizer may be redefined for step 1604 as the sampling technique of FIG. 13may not return a sample relation S₁ of the original sample size r.Alternatively, the sampling fraction f may be initialized for step 1602,and the sample size r may be determined based on the number of tuplesthe sampling technique of FIG. 13 returns for sample relation S₁.

[0178] For step 1606, sample relation S₁ is materialized and the numberu₁(v) of join attribute values v in sample relation S₁ is determined forall values v of domain D of join attribute A. For step 1608, r tuplesare sampled from relation R₂ to produce a sample relation S₂ such thatthe number u₂(v) of tuples in relation S₂ with join attribute value v isthe same as that in relation S₁, that is u₂(v)=u₁(v). Any suitablesampling technique may be used for step 1608. As one example, anunweighted sampling technique using the sampling semantics as designatedby a join sample operator or as predetermined by database server 620 maybe used to obtain sample relation S₂. For WR semantics, step 1608 forone example may be performed by scanning relation R₂ and separatelyperforming, for each join attribute value v, the sampling technique ofFIG. 7 on the tuple(s) of relation R₂ having the join attribute value v,noting the size of each relation being sampled is m₂(v) and the size ofeach sample to be obtained is u₁(v) for the sampling technique of FIG.7. For each join attribute value v, then, a sample of u₁(v) tuples areobtained with replacement from the m₂(v) tuples of relation R₂.

[0179] For each tuple of sample relation S₂, for step 1610, a tuplehaving the same join attribute value from sample relation S₁ is randomlysampled without replacement and joined with the tuple from samplerelation S₂ to produce a tuple for sample relation S. Sample relation S₂may optionally be materialized and randomly matched to sample relationS₁ for step 1610.

[0180] The count sampling technique of FIG. 16 may be used where noinformation is available for relation R₁ and where frequency statisticson the join attribute values yet no index is available for relation R₂.Whether relation R₁ is materialized or produced as a stream from apipeline process, the sampling technique of FIG. 16 may sequentiallysample tuples from relation R₁ for step 1604 in one pass. The countsampling technique of FIG. 16 also does not require an index on relationR₂ as relation R₂ may be scanned only once for step 1608.

[0181] FREQUENCY PARTITION SAMPLING

[0182] As shown by the sampling techniques of FIGS. 14-16, havingcomplete frequency statistics on relation R₂ can be advantageous.Complete frequency statistics on relation R₂, however, may not beavailable. Because the availability of a histogram having frequencystatistics on high frequency values in relation R₂ may be reasonablyassumed and because a significant source of inefficiency in computing afull join of two relations is the presence of join attribute valueshaving a relatively high frequency, as the size of the join becomeslarge because of the high frequency values, database server 620 mayperform a frequency partition sampling technique over joins by computinga full join of the tuples having low frequency join attribute values inrelations R₁ and R₂ and obtaining a sample from this join, as in thejoin sampling technique of FIG. 2, and by using frequency statistics onhigh frequency join attribute values in relation R₂ to obtain a sampleof the join of tuples having high frequency join attribute values inrelations R₁ and R₂, as in the join sampling techniques of FIGS. 14, 15,and 16.

[0183] The join attribute values need not be of high frequencysimultaneously in both relations R₁ and R₂. Rather, the join attributevalues which have a high frequency in relation R₂ may be used as asuitable approximation to the set of join attribute values which createa large number of tuples in the join.

[0184] Database server 620 for one example may perform a frequencypartition sampling technique over joins in accordance with a flowdiagram 1700 of FIG. 17.

[0185] For step 1702 of FIG. 17, a variable r is initialized to the sizeof sample relation S to be obtained from the join of relations R₁ andR₂. For step 1704, a partition parameter p is determined for domain D ofjoin attribute A to separate high frequency join attribute values fromlow frequency join attribute values in relations R₁ and R₂. Partitionparameter p may be determined in any suitable manner. As one example,partition parameter p may be determined based on a suitable thresholdpercentage of the number of tuples of relation R₂, such as 2%, orp=0.02*n₂, for example. The suitability of a particular partitionparameter p may depend, for example, on the data distribution skew ofthe join attribute values in relation R₂. A suitable partition parameterfor a given pair of relations R₁ and R₂ may be determinedexperimentally, for example, by using different partition parameters forthe join sampling technique of FIG. 17 and comparing the resultingsamples with a sample produced from the join sampling technique of FIG.2.

[0186] For step 1706, domains D^(hi) and D^(lo) are defined based onparameter p. The domain D^(hi) may be defined as the set of values indomain D having a frequency greater than or equal to partition parameterp in relation R₂, and the domain D^(lo) may be defined as the remainingvalues in domain D, that is those values in domain D having a frequencyless than parameter p in relation R₂. Domains D^(hi) and D^(lo) are usedto induce a corresponding partition of relation R₁ into relations R₁^(hi)=R₁|_(D) ^(hi) and R₁ ^(lo)=R₁|_(D) ^(lo) and of relation R₂ intorelations R₂ ^(hi)=R₂|_(D) ^(hi) and R₂ ^(lo)=R₂|_(D) ^(lo), where R|_(D′)denotes the selection of tuples from relation R with a joinattribute value in domain D′.

[0187] For step 1708, a relation J^(lo) is computed as the joinJ^(lo).=R₁ ^(lo)

R₂ ^(lo), and a sample relation J^(*hi) is computed as a sample of thejoin J^(hi)=R₁ ^(hi)

R₂ ^(hi) using the sampling semantics as designated by a join sampleoperator or as predetermined by database server 620. Sample relationJ^(*hi) may be computed, for example, in accordance with the streamsampling technique of FIG. 14, the group sampling technique of FIG. 15,or the count sampling technique of FIG. 16.

[0188] For step 1710, a suitable number, such as the sample size r forexample, of tuples from relation J^(lo) are sampled to produce a samplerelation J^(*lo). Sample relation J^(*lo) may be obtained from relationJ^(lo) using any suitable sampling technique such as the samplingtechnique of FIG. 7 or 8, for example. The computation of the joinJ^(lo)=R₁ ^(lo)

R₂ ^(lo) for step 1708 and the sampling of relation J^(lo) for step 1710generally corresponds to the join sampling technique of FIG. 2.

[0189] A sample relation S of the join J=R₁

R₂ is then determined for step 1712 based on sample relations J^(*lo)and J^(*hi). Sample relation S may be determined in any suitable mannerbased on sample relations J^(lo) and J^(*hi). As one example, samplerelation S may be determined by sampling a number of tuples from eachsample relation J^(*lo) and J^(*hi) in proportion to the relative sizeof the join in each domain D^(lo) and D^(hi), respectively. Obtainingsample relation S in this manner may be facilitated by sampling the samenumber r of tuples for each sample relation J^(*lo) and J^(*hi).

[0190] Database server 620 for one example may perform the frequencypartition sampling technique of FIG. 17 in accordance with the blockdiagram of FIG. 18.

[0191] As illustrated in FIG. 18, relation R₁ is received as a streamand, for steps 1704 and 1706 of FIG. 17, is partitioned into relationsR₁ ^(hi) and R₁ ^(lo) by evaluating each tuple of relation R₁ as it isreceived and directing the tuple either through a weighted sampling tool1802 if the tuple has a join attribute value that exceeds parameter p orpast weighted sampling tool 1802 if the tuple has a join attribute valuethat does not exceed parameter p. As the tuples of relation R₁ ^(hi)stream by, frequency statistics on the join attribute values of relationR₁ ^(hi) may be collected and combined with the frequency statistics onthe join attribute values of relation R₂ ^(hi) to determine the net sizen_(hi) of the join J^(hi)=R₁ ^(hi)

R₂ ^(hi).

[0192] For step 1708, weighted sampling tool 1802 uses frequencystatistics on the join attribute values of relation R₂ as weights toobtain a sample relation S₁ of r tuples from relation R₁ ^(hi). Weightedsampling tool 1802 may perform any suitable sequential samplingtechnique such as that of FIG. 9 or 10, for example, for WR semantics.The tuples of sample relation S₁ are merged back into the stream withthe tuples of relation R₁ ^(lo), and the resulting merged output streamis denoted as a relation R₁ ^(*)=S₁∪R₁ ^(lo). The tuples of relation R₁^(*) are directed through a join tool 1804 with the tuples of relationR₂ to compute a relation J^(*) as the join J^(*)=R₁ ^(*)

R₂. As the tuples of relation J^(*) are output, the number n_(lo) oftuples in relation J^(*) that contain D^(lo) values, that is the numbern_(lo) of tuples in the join J^(lo)=R₁ ^(lo)

R₂ ^(lo), may be determined.

[0193] Relation J^(*) is then partitioned with the tuples having highfrequency join attribute values directed through a group sampling tool1806 where one random tuple is sampled from each of the r groupscorresponding to the r tuples of sample relation S₁ to produce samplerelation J^(*hi). Weighted sampling tool 1802, join tool 1804, and groupsampling tool 1806 thus compute sample relation J^(*hi) as a sample ofthe join J^(hi)=R₁ ^(hi)

R₂ ^(hi) using WR sampling semantics in accordance with the groupsampling technique of FIG. 15.

[0194] The tuples of relation J^(*) having low frequency join attributevalues, that is the tuples of relation J^(lo), are directed through asampling tool 1808 where r tuples from relation J^(lo) are sampled toproduce sample relation J^(*lo) for step 1710. Sampling tool 1808 mayperform any suitable sampling technique, such as the sampling techniqueof FIG. 7 or 8 for example. Join tool 1804 and sampling tool 1808 thuscompute sample relation J^(*lo) as a sample of the join J^(lo)=R₁ ^(lo)

R₂ ^(lo) generally in accordance with the join sampling technique ofFIG. 2.

[0195] For step 1712, a number of tuples to be sampled from each samplerelation J^(*hi) and J^(*lo) is determined using a coin flip techniquewhere r coins are flipped with heads probability proportional to n_(hi)and tails probability proportional to n_(lo). The number of heads isr_(h), and the number of tails is r_(t), noting r=r_(h)+r_(t). Thetuples of sample relation J^(*hi) are directed through a sampling tool1810 where r_(h) tuples of sample relation J^(*hi) are sampled, and thetuples of sample relation J^(*lo) are directed through a sampling tool1812 where r_(t) tuples of sample relation J^(lo) are sampled. Samplingtools 1810 and 1812 may each perform any suitable sampling technique,such as an unweighted sequential WoR sampling technique for example. Thesample r_(h) tuples of sample relation J^(*hi) and the sample r_(t)tuples of sample relation J^(*lo) are then combined to form samplerelation S of r tuples.

[0196] The frequency partition sampling technique example of FIG. 18requires neither an index nor complete frequency statistics on the joinattribute values of relation R₂. Rather, a histogram having frequencystatistics for the high frequency values in relation R₂ is used. If anindex on relation R₂ ^(hi) is available in addition to the frequencystatistics on relation R₂ ^(hi), then sample relation J^(*hi) may becomputed for step 1708 using the stream sampling technique of FIG. 14.Alternatively, the use of an index on relation R₂ may be replaced with ascan of relation R₂ ^(hi) by computing sample relation J^(*hi) for step1708 using the count sampling technique of FIG. 16.

[0197] Database server 620 for another example may perform the frequencypartition sampling technique of FIG. 17 using CF semantics by computingsample relation J^(*hi) for step 1708 using CF semantics in accordancewith the group sampling technique of FIG. 15. The tuples of relationJ^(lo) may be sampled for step 1710 using a suitable unweighted CFsampling technique to produce sample relation J^(*lo), and samplerelations J^(*hi) and J^(*lo) may then be combined for step 1712 toproduce sample relation S.

[0198] JOIN SAMPLING IMPLEMENTATION EXAMPLES

[0199] As examples, the sequential WR sampling techniques of FIGS. 7 and9 may each be implemented with the Microsoft® SQL Server as an operator.Because of the object-oriented design of code in the Microsoft® SQLServer, adding each operator to a query execution tree may be done bycreating a derived class of a base operator class and implementingstandard methods, such as Open, Close, and GetRow for example.

[0200] The join sampling techniques of FIGS. 2, 4, 14, and 18 may beimplemented with the Microsoft® SQL Server in the following exemplarymanners.

[0201] For the join sampling technique of FIG. 2, an execution treegenerated by the optimizer of the Microsoft® SQL Server may be modifiedby adding the operator for the sequential WR sampling technique of FIG.7 as the root of the execution tree. The resulting query execution treeproduces a random sample of the query.

[0202] For the join sampling technique of FIG. 4, a random sample oftuples of relation R₁ may be obtained by creating a uniform randomsample of the key values of relation R₁, storing them in a temporarytable T₁, and joining table T₁ with relation R₁ on the key value. Also,each join method in the Microsoft® SQL Server, such as the Nested Loops,Hash Join, and Merge Join for example, may be modified so that eachsampled tuple from relation R₁ is joined with a random tuple fromrelation R₂ among all tuples in relation R₂ having a matching joinattribute value with the sampled tuple from relation R₁. For the NestedLoops join method, for example, a random number k between 1 and f(v) forthe join attribute value v in a sampled tuple from relation R₁ isdetermined. The first k-1 matches for the join are skipped beforeconsidering the kth match. This kth tuple may be accepted withprobability m₂(v)/M. The other sampled tuples from relation R₁ aresimilarly evaluated.

[0203] For the stream sampling technique of FIG. 14, the operator forthe sequential WR sampling technique of FIG. 9 may be implemented as achild of the join method, that is between scan operator on relation R₁and the join operator. The operator may determine the number of copiesto be generated and produce that many copies for each tuple of relationR₁ in the GetRow method. The statistics on the join attribute values ofrelation R₂ may be read from a file and stored in a work table indexedon the join attribute values to facilitate access of the frequency of agiven join attribute value. Also, the join operator may be modified sothat for each tuple sampled from relation R₁, exactly one tuple isoutput at random from among all the tuples that join with relation R₂.

[0204] For the frequency partition sampling technique of FIG. 18, theoperator for the stream sampling technique may be modified to generatethe weighted random sample from only the tuples having high frequencyjoin attribute values in relation R₁. The tuples of relation R₁ havinglow frequency join attribute values may pass through the operator. Thestatistics for high frequency join attribute values used by thisoperator may be read from a file and stored in a work table as for thestream sampling technique. Also, the join operator may be modified toperform the group sampling technique to produce the sample J^(*hi). Theoperator for the sequential WR sampling technique of FIG. 9 may be addedover the join operator to produce the sample J^(*lo). An operator formaterializing the samples J^(*hi) and J^(*lo) and for sampling themwithout replacement may be added as the root of the query tree toproduce the sample of the join.

[0205] In the foregoing description, the invention has been describedwith reference to specific exemplary embodiments thereof. It will,however, be evident that various modifications and changes may be madethereto without departing from the broader spirit or scope of thepresent invention as defined in the appended claims. The specificationand drawings are, accordingly, to be regarded in an illustrative ratherthan a restrictive sense.

What is claimed is:
 1. A sample operator for obtaining a sample of aplurality of records in a database system, the sample operator havingthe plurality of records and sampling semantics as parameters.
 2. Thesample operator of claim 1, wherein the sampling semantics may be withreplacement, without replacement, or coin flip sampling semantics. 3.The sample operator of claim 1 having a size of the sample as aparameter.
 4. The sample operator of claim 1 having a weight function asa parameter to specify a sampling weight for each record.
 5. The sampleoperator of claim 1 for obtaining the sample of the plurality of recordsin one pass by: (a) obtaining one record from the plurality of records;(b) selectively outputting the one record obtained in step (a) one ormore times based on a probability; and (c) repeating steps (a) and (b)for one or more other records of the plurality of records to form thesample of the plurality of records.
 6. The sample operator of claim 1for obtaining the sample of the plurality of records in one pass by: (a)obtaining one record from the plurality of records; (b) selectivelyresetting one or more records of a reservoir to be the one recordobtained in step (a) based on a probability; and (c) repeating steps (a)and (b) for other records of the plurality of records such that therecords of the reservoir form the sample of the plurality of records. 7.A sample operator for obtaining a sample of a plurality of records in adatabase system, the sample operator having the plurality of records asa parameter and having a weight function as a parameter to specify asampling weight for each record.
 8. The sample operator of claim 7having a size of the sample as a parameter.
 9. The sample operator ofclaim 7 for obtaining the sample of the plurality of records in one passby: (a) obtaining one record from the plurality of records and theweight specified for the one record; (b) selectively outputting the onerecord obtained in step (a) one or more times based on the weightspecified for the one record; and (c) repeating steps (a) and (b) forone or more other records of the plurality of records to form the sampleof the plurality of records.
 10. The sample operator of claim 7 forobtaining the sample of the plurality of records in one pass by: (a)obtaining one record from the plurality of records and the weightspecified for the one record; (b) selectively resetting one or morerecords of a reservoir to be the one record obtained in step (a) basedon the weight specified for the one record; and (c) repeating steps (a)and (b) for other records of the plurality of records such that therecords of the reservoir form the sample of the plurality of records.11. A method for obtaining a sample from a plurality of records in adatabase system, the method comprising the steps of: (a) identifying theplurality of records and sampling semantics from parameters of a sampleoperator; and (b) obtaining a sample from the identified plurality ofrecords using the identified sampling semantics.
 12. The method of claim11, wherein the identified sampling semantics is with replacement,without replacement, or coin flip sampling semantics.
 13. The method ofclaim 11, wherein the identifying step (a) comprises the step ofidentifying a size of the sample to be obtained from a parameter of thesample operator; and wherein the obtaining step (b) comprises the stepof obtaining the sample from the identified plurality of records basedon the identified sample size.
 14. The method of claim 11, wherein theidentifying step (a) comprises the step of identifying a weight functionfrom a parameter of the sample operator to specify a weight for eachrecord; and wherein the obtaining step (b) comprises the step ofobtaining the sample from the identified plurality of records based onthe specified weight of each record.
 15. The method of claim 11, whereinthe obtaining step (b) comprises the steps of: (i) obtaining one recordfrom the plurality of records, (ii) selectively outputting the onerecord obtained in step (b)(i) one or more times based on a probability,and (iii) repeating steps (b)(i) and (b)(ii) for one or more otherrecords of the plurality of records to obtain the sample.
 16. The methodof claim 11, wherein the obtaining step (b) comprises the steps of: (i)obtaining one record from the plurality of records, (ii) selectivelyresetting one or more records of a reservoir to be the one recordobtained in step (b)(i) based on a probability, and (iii) repeatingsteps (b)(i) and (b)(ii) for other records of the plurality of recordssuch that the records of the reservoir form the sample.
 17. A computerreadable medium having computer-executable instructions for performingthe steps of claim
 11. 18. A method for obtaining a sample from aplurality of records in a database system, the method comprising thesteps of: (a) identifying the plurality of records and a weight functionfrom parameters of a sample operator, wherein the weight functionspecifies a weight for each record; and (b) obtaining a sample from theidentified plurality of records based on the specified weight of eachrecord.
 19. The method of claim 18, wherein the identifying step (a)comprises the step of identifying a size of the sample to be obtainedfrom a parameter of the sample operator; and wherein the obtaining step(b) comprises the step of obtaining the sample from the identifiedplurality of records based on the identified sample size.
 20. The methodof claim 18, wherein the obtaining step (b) comprises the steps of: (i)obtaining one record from the plurality of records and the weightspecified for the one record, (ii) selectively outputting the one recordobtained in step (b)(i) one or more times based on the weight specifiedfor the one record, and (iii) repeating steps (b)(i) and (b)(ii) for oneor more other records of the plurality of records to obtain the sample.21. The method of claim 18, wherein the obtaining step (b) comprises thesteps of: (i) obtaining one record from the plurality of records and theweight specified for the one record, (ii) selectively resetting one ormore records of a reservoir to be the one record obtained in step (b)(i)based on the weight specified for the one record, and (iii) repeatingsteps (b)(i) and (b)(ii) for other records of the plurality of recordssuch that the records of the reservoir form the sample.
 22. A computerreadable medium having computer-executable instructions for performingthe steps of claim
 18. 23. A method for performing a sequential samplingof records in one pass in a database system, the method comprising thesteps of: (a) obtaining one record from a plurality of records; (b)selectively outputting the one record obtained in step (a) one or moretimes based on a probability; and (c) repeating steps (a) and (b) forone or more other records of the plurality of records to form a sampleof the plurality of records, wherein at least one record obtained instep (a) may be output more than one time for step (b).
 24. The methodof claim 23, wherein the plurality of records are a relation produced asa stream of records as a result of a query.
 25. The method of claim 23,wherein the plurality of records are materialized as a base relation ina database of the database system.
 26. The method of claim 23, whereinthe selectively outputting step (b) comprises the steps of: (i)determining a random number based on the probability such that therandom number is greater than or equal to zero, and (ii) outputting theone record the determined random number of times.
 27. The method ofclaim 26, wherein the determining step (b)(i) comprises the step ofdetermining the random number from a binomial distribution based on theprobability.
 28. The method of claim 26, wherein the determining step(b)(i) comprises the step of determining the random number such that therandom number is less than or equal to a number of record(s) remainingto be output for the sample.
 29. The method of claim 26, wherein thedetermining step (b)(i) comprises the step of determining the randomnumber based on a probability based on a number of record(s) of theplurality of records to be evaluated for output for step (b).
 30. Themethod of claim 23, wherein the selectively outputting step (b)comprises the step of selectively outputting the one record one or moretimes based on a weight specified for the one record.
 31. The method ofclaim 26, wherein the determining step (b)(i) comprises the step ofdetermining the random number based on a probability based on a weightof the one record divided by a sum of weight(s) of record(s) of theplurality of records to be evaluated for output for step (b).
 32. Themethod of claim 23, wherein the selectively outputting step (b)comprises the step of selectively outputting the one record based on aprobability a number of time(s) equal in number to the weight of the onerecord.
 33. The method of claim 32, wherein the selectively outputtingstep comprises the step of outputting the one record with a probabilitybased on a number of record(s) remaining to be output for the sampledivided by a number of possible record(s) that may be output for step(b).
 34. The method of claim 26, wherein the determining step (b)(i)comprises the step of determining the random number such that the randomnumber is less than or equal to a weight of the one record.
 35. Themethod of claim 26, wherein the determining step (b)(i) comprises thestep of determining the random number based on a probability based on afraction of the plurality of records.
 36. The method of claim 23,wherein the plurality of records form a relation and wherein the sampleis to be joined with records of another relation.
 37. A method forperforming a sequential sampling of records in one pass in a databasesystem, the method comprising the steps of: (a) obtaining one recordfrom a plurality of records; (b) selectively resetting one or morerecords of a reservoir to be the one record obtained in step (a) basedon a probability; and (c) repeating steps (a) and (b) for other recordsof the plurality of records such that the records of the reservoir forma sample of the plurality of records, wherein at least one recordobtained in step (a) may be used to reset more than one record of thereservoir for step (b).
 38. The method of claim 37, wherein theselectively resetting step (b) comprises the step of selectivelyresetting each record of the reservoir to be the one record obtained instep (a) based on a probability.
 39. The method of claim 37, wherein theselectively resetting step (b) comprises the step of selectivelyresetting one or more records of the reservoir to be the one recordobtained in step (a) with a probability based on a number of record(s)that have been obtained for step (a).
 40. The method of claim 37,wherein the selectively resetting step (b) comprises the step ofselectively resetting one or more records of the reservoir to be the onerecord obtained in step (a) based on a weight of the one record obtainedin step (a).
 41. The method of claim 37, wherein the selectivelyresetting step (b) comprises the step of selectively resetting one ormore records of the reservoir to be the one record obtained in step (a)with a probability based on a weight of the one record obtained in step(a) divided by a sum of weight(s) of record(s) that have been obtainedfor step (a).
 42. The method of claim 37, wherein the selectivelyresetting step (b) comprises the step of selectively resetting a randomrecord of the reservoir to be the one record obtained in step (a) basedon a probability a number of time(s) equal in number to the weight ofthe one record obtained in step (a).
 43. The method of claim 37, whereinthe selectively resetting step (b) comprises the step of selectivelyresetting a random record of the reservoir to be the one record obtainedin step (a) with a probability based on a number of records in thereservoir divided by a sum of record(s) evaluated for reset in thereservoir.
 44. The method of claim 37, wherein the plurality of recordsform a relation and wherein the sample is to be joined with records ofanother relation.
 45. A computer readable medium havingcomputer-executable instructions for performing a sequential sampling ofrecords in one pass, the computer-executable instructions for performingthe steps of: (a) obtaining one record from a plurality of records; (b)selectively outputting the one record obtained in step (a) one or moretimes based on a probability; and (c) repeating steps (a) and (b) forone or more other records of the plurality of records to form a sampleof the plurality of records, wherein at least one record obtained instep (a) may be output more than one time for step (b).
 46. The computerreadable medium of claim 45, wherein the plurality of records are arelation produced as a stream of records as a result of a query.
 47. Thecomputer readable medium of claim 45, wherein the plurality of recordsare materialized as a base relation in a database of the databasesystem.
 48. The computer readable medium of claim 45, wherein theselectively outputting step (b) comprises the steps of: (i) determininga random number based on the probability such that the random number isgreater than or equal to zero, and (ii) outputting the one record thedetermined random number of times.
 49. The computer readable medium ofclaim 48, wherein the determining step (b)(i) comprises the step ofdetermining the random number from a binomial distribution based on theprobability.
 50. The computer readable medium of claim 48, wherein thedetermining step (b)(i) comprises the step of determining the randomnumber such that the random number is less than or equal to a number ofrecord(s) remaining to be output for the sample.
 51. The computerreadable medium of claim 48, wherein the determining step (b)(i)comprises the step of determining the random number based on aprobability based on a number of record(s) of the plurality of recordsto be evaluated for output for step (b).
 52. The computer readablemedium of claim 45, wherein the selectively outputting step (b)comprises the step of selectively outputting the one record one or moretimes based on a weight specified for the one record.
 53. The computerreadable medium of claim 48, wherein the determining step (b)(i)comprises the step of determining the random number based on aprobability based on a weight of the one record divided by a sum ofweight(s) of record(s) of the plurality of records to be evaluated foroutput for step (b).
 54. The computer readable medium of claim 45,wherein the selectively outputting step (b) comprises the step ofselectively outputting the one record based on a probability a number oftime(s) equal in number to the weight of the one record.
 55. Thecomputer readable medium of claim 54, wherein the selectively outputtingstep comprises the step of outputting the one record with a probabilitybased on a number of record(s) remaining to be output for the sampledivided by a number of possible record(s) that may be output for step(b).
 56. The computer readable medium of claim 48, wherein thedetermining step (b)(i) comprises the step of determining the randomnumber such that the random number is less than or equal to a weight ofthe one record.
 57. The computer readable medium of claim 48, whereinthe determining step (b)(i) comprises the step of determining the randomnumber based on a probability based on a fraction of the plurality ofrecords.
 58. The computer readable medium of claim 45, wherein theplurality of records form a relation and wherein the sample is to bejoined with records of another relation.
 59. A computer readable mediumhaving computer-executable instructions for performing a sequentialsampling of records in one pass, the computer-executable instructionsfor performing the steps of: (a) obtaining one record from a pluralityof records; (b) selectively resetting one or more records of a reservoirto be the one record obtained in step (a) based on a probability; and(c) repeating steps (a) and (b) for other records of the plurality ofrecords such that the records of the reservoir form a sample of theplurality of records, wherein at least one record obtained in step (a)may be used to reset more than one record of the reservoir for step (b).60. The computer readable medium of claim 59, wherein the selectivelyresetting step (b) comprises the step of selectively resetting eachrecord of the reservoir to be the one record obtained in step (a) basedon a probability.
 61. The computer readable medium of claim 59, whereinthe selectively resetting step (b) comprises the step of selectivelyresetting one or more records of the reservoir to be the one recordobtained in step (a) with a probability based on a number of record(s)that have been obtained for step (a).
 62. The computer readable mediumof claim 59, wherein the selectively resetting step (b) comprises thestep of selectively resetting one or more records of the reservoir to bethe one record obtained in step (a) based on a weight of the one recordobtained in step (a).
 63. The computer readable medium of claim 59,wherein the selectively resetting step (b) comprises the step ofselectively resetting one or more records of the reservoir to be the onerecord obtained in step (a) with a probability based on a weight of theone record obtained in step (a) divided by a sum of weight(s) ofrecord(s) that have been obtained for step (a).
 64. The computerreadable medium of claim 59, wherein the selectively resetting step (b)comprises the step of selectively resetting a random record of thereservoir to be the one record obtained in step (a) based on aprobability a number of time(s) equal in number to the weight of the onerecord obtained in step (a).
 65. The computer readable medium of claim59, wherein the selectively resetting step (b) comprises the step ofselectively resetting a random record of the reservoir to be the onerecord obtained in step (a) with a probability based on a number ofrecords in the reservoir divided by a sum of record(s) evaluated forreset in the reservoir.
 66. The computer readable medium of claim 59,wherein the plurality of records form a relation and wherein the sampleis to be joined with records of another relation.
 67. A database systemfor performing a sequential sampling of records in one pass in thedatabase system, the database system comprising: means for performingthe step of (a) obtaining one record from a plurality of records; meansfor performing the step of (b) selectively outputting the one record oneor more times based on a probability; and means for repeating steps (a)and (b) for one or more other records of the plurality of records toform a sample of the plurality of records, wherein at least one recordobtained in step (a) may be output more than one time for step (b). 68.The database system of claim 67, wherein the plurality of records are arelation produced as a stream of records as a result of a query.
 69. Thedatabase system of claim 67, wherein the plurality of records arematerialized as a base relation in a database of the database system.70. The database system of claim 67, wherein the selectively outputtingstep (b) comprises the steps of: (i) determining a random number basedon the probability such that the random number is greater than or equalto zero, and (ii) outputting the one record the determined random numberof times.
 71. The database system of claim 70, wherein the determiningstep (b)(i) comprises the step of determining the random number from abinomial distribution based on the probability.
 72. The database systemof claim 70, wherein the determining step (b)(i) comprises the step ofdetermining the random number such that the random number is less thanor equal to a number of record(s) remaining to be output for the sample.73. The database system of claim 70, wherein the determining step (b)(i)comprises the step of determining the random number based on aprobability based on a number of record(s) of the plurality of recordsto be evaluated for output for step (b).
 74. The database system ofclaim 70, wherein the selectively outputting step (b) comprises the stepof selectively outputting the one record one or more times based on aweight specified for the one record.
 75. The database system of claim70, wherein the determining step (b)(i) comprises the step ofdetermining the random number based on a probability based on a weightof the one record divided by a sum of weight(s) of record(s) of theplurality of records to be evaluated for output for step (b).
 76. Thedatabase system of claim 67, wherein the selectively outputting step (b)comprises the step of selectively outputting the one record based on aprobability a number of time(s) equal in number to the weight of the onerecord.
 77. The database system of claim 76, wherein the selectivelyoutputting step comprises the step of outputting the one record with aprobability based on a number of record(s) remaining to be output forthe sample divided by a number of possible record(s) that may be outputfor step (b).
 78. The database system of claim 70, wherein thedetermining step (b)(i) comprises the step of determining the randomnumber such that the random number is less than or equal to a weight ofthe one record.
 79. The database system of claim 70, wherein thedetermining step (b)(i) comprises the step of determining the randomnumber based on a probability based on a fraction of the plurality ofrecords.
 80. The database system of claim 67, wherein the plurality ofrecords form a relation and wherein the sample is to be joined withrecords of another relation.
 81. A database system for performing asequential sampling of records in one pass in the database system, thedatabase system comprising: means for performing the step of (a)obtaining one record from a plurality of records; means for performingthe step of (b) selectively resetting one or more records of a reservoirto be the one record obtained in step (a) based on a probability; andmeans for performing the step of (c) repeating steps (a) and (b) forother records of the plurality of records such that the records of thereservoir form a sample of the plurality of records, wherein at leastone record obtained in step (a) may be used to reset more than onerecord of the reservoir for step (b).
 82. The database system of claim81, wherein the selectively resetting step (b) comprises the step ofselectively resetting each record of the reservoir to be the one recordobtained in step (a) based on a probability.
 83. The database system ofclaim 81, wherein the selectively resetting step (b) comprises the stepof selectively resetting one or more records of the reservoir to be theone record obtained in step (a) with a probability based on a number ofrecord(s) that have been obtained for step (a).
 84. The database systemof claim 81, wherein the selectively resetting step (b) comprises thestep of selectively resetting one or more records of the reservoir to bethe one record obtained in step (a) based on a weight of the one recordobtained in step (a).
 85. The database system of claim 81, wherein theselectively resetting step (b) comprises the step of selectivelyresetting one or more records of the reservoir to be the one recordobtained in step (a) with a probability based on a weight of the onerecord obtained in step (a) divided by a sum of weight(s) of record(s)that have been obtained for step (a).
 86. The database system of claim81, wherein the selectively resetting step (b) comprises the step ofselectively resetting a random record of the reservoir to be the onerecord obtained in step (a) based on a probability a number of time(s)equal in number to the weight of the one record obtained in step (a).87. The database system of claim 81, wherein the selectively resettingstep (b) comprises the step of selectively resetting a random record ofthe reservoir to be the one record obtained in step (a) with aprobability based on a number of records in the reservoir divided by asum of record(s) evaluated for reset in the reservoir.
 88. The databasesystem of claim 81, wherein the plurality of records form a relation andwherein the sample is to be joined with records of another relation.